EXPERIMENTING WITH RULE LEARNING FOR INFORMATION EXTRACTION FROM HTML Presented at 6 Int. Symposium SYNASC04, Timişoara, Romania
نویسنده
چکیده
The Web is a continuously growing information repository with a rich semantic structure that spans many application areas. The Web, however, has been designed primarily for human consumption rather than automated processing. This is a major obstacle for automating tasks like information searching, filtering and extraction. In this context, the aim of the paper is to present a technique for learning rules to extract product information from HTML information sources that represent product information sheets. The technique exploits the fact that the Web pages that represent product information of a certain producer are generated on the fly from the producer database and therefore they exhibit uniform structures. Consequently, while the extraction task is executed manually for a few information items by a human user, a general-purpose inductive learner can learn extraction rules that will be further applied to the current and other product information sheets to automatically extract other items. The input to the learning algorithm is a relational description of the HTML document tree that defines the HTML tree nodes types and the relationships between them. The approach is demonstrated with appropriate examples, experimental results, and software tools.
منابع مشابه
Automated Generation of Loop Invariants by Recurrence Solving in Theorema ∗ Presented at 6 International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC04), Timişoara, Romania
Most of the properties established during program verification are either invariants or depend crucially on invariants. The effectiveness of automated verification of (imperative) programs is therefore sensitive to the ease with which invariants, even trivial ones, can be automatically deduced. We present a method for invariant generation that relies on combinatorial techniques, namely on recur...
متن کاملJADE BASED MULTI-AGENT E-COMMERCE ENVIRONMENT: INITIAL IMPLEMENTATION Presented at 6 Int. Symposium SYNASC04, Timişoara, Romania
Recent advances in software engineering, business process management and computational intelligence resulted in methods and techniques for developing advanced e-commerce applications as well as supporting automating e-commerce business processes. Despite this fact, up to now, the most successful e-commerce systems are still based on humans to make the most important decisions in various activit...
متن کاملSources of Success for Information Extraction Methods
In this paper, we examine an important recent rule-based information extraction (IE) technique named Boosted Wrapper Induction (BWI), by conducting experiments on a wider variety of tasks than previously studied, including tasks using several collections of natural text documents. We provide a systematic analysis of how each algorithmic component of BWI, in particular boosting, contributes to i...
متن کاملGleaning answers from the web∗
A wide variety of valuable textual information resides on the Web, but very little is in a machineunderstandable form such as XML. Instead, the content is usually embedded in HTML markup or other encodings designed for human consumption. The information extraction task is to automatically populate a database with content gleaned from information sources such as Web pages. Wrappers are an import...
متن کاملAutoWrapper: automatic wrapper generation for multiple online services
A crucial challenge for information extraction from the WWW is to generate wrappers, which are information extraction patterns or rules, which apply to numerous Web sites with great diversity in both format and content. Generating wrappers manually is tedious, time consuming and errorprone. Recent research has successfully adapted machine learning technology to generate wrappers for semi-struct...
متن کامل